N26 Forecasting Task

A Couple of points to note:

- I have created a requirements.txt file from the virtual environment I used for this project and note I used python version 3.10.0, I believe you will run into errors with auto arima in later python versions, please set up your environment with this set up if you wish to replicate my code

- I never got time to [put] my models out into funtions in seperatate .py which is what I would do in a production setting

- For some of the models at the end i have manually set the seasonal term P in my SARIMA model to 1, ignoring the output from auto arima, this is because I can clearly see at least and order 1 seasonal 7 day pattern in those cases

- Some of the code at the bottom repeats itself quite often. In a production setting I would rewrite this using DRY code once I identify the best models from experimentation and seperate out in .py files, apologies if it makes reading through this notebook an arduous task

1. Data Preperation

In the data preperation step I load the data, check for nulls, duplicates and outliers and also set the optimum python data types.

Notice the size of the data is almost a quarter of the original dataframe, I did have a debate in my mind about setting the direction column to boolean however decided against because in reality a third option of neither in or out is possible in the data

In the previous step I wanted to check if there are duplicates, you will notice that there are, however given the nature of the data it is perfectly possible that customers will make the same transaction more than once, without a unique identifier for each transaction I cannot distinguish between repeat transactions and duplicates with 100% certainty, I recieved confirmation from data owner that I can remove them, in the real world I would ensure a tranction id is added, especially if I am building a solution that will later go into production

In this data there are a significant number of outliers as you can see from the box and whisker plot, however given the nature of what the data represents, you would expect a pareto distribution in both income and outgoings and there is isnt anything that is really outlandish

2. Exploratory Data Analysis

In this step I explore the data column by column in order to gain a better understanding of the data. This knowledge is then used to inform some of the decisions made when building my forecasting model.

Looking at the data I can see a couple of things immediately, visually there doesnt appear to be a trend, however there deos appear to be seasonality in the data, it looks like there is a weekly pattern to the data

In both income and expenditure I can see that the real changes month by month occur in the top quartile and beyond

This is an important bit of information, basically income is almost all via credit transfers, hence i will likely just forcast income for august as a whole, doesn't make much sense to split the forecast

Before I go into modelling I want to build some time series plots of some of the categories, in order to see if I can find patterns, this will inform my decision on how i will model expenditure

for expenditure, based on the charts I think it makes sense to model debit transfers and direct debits seperatly, will investigate whether it is worth breaking down the presentment type further

With the exception of groceries i can't really see the usefulness of splitting presentment down further, i think there is a clear path forward for modelling, income done as a whole and expenditure done in four groups, debit transfers, direct debits, groceries, everything else

3. Time Series Forecasting

The next step after exploratory data analysis is to build my forecasting model, I have decided to build my model with data aggregated into incoming and outcoming flows, there were a couple of reasons for doing this some of which I alluded to above

- For income this was a particularly easy deicion, since mcc categories are irrelevant, the only potential split would be on transaction types however since almost all income is in the form of credit transfer, it is nonsensical to do this

- For outgoings there is potentially more reason to consider splitting my forecast, I have decided to split the forecast for expenditure as outlined above.

I would like to note here that I do think these category columns are extremely useful bits of data, particularly if we are interested in the behaviour of our customers and want to seperate different types of users. For this reason if I have time I will add a clustering section to show what else can be done with this data.

the plots support the evidence suggesting a seasonal component to the time series, it looks like there is a 7 day cyle to the data, there appears to be statistically signigicant correlation at 7,14,21 etc day intervals

Again the ETS decompoistion clear shows seasonality in the data, now we can move on to actually building out the model, note that the decompositions shows the monthly spike in the trend chart and the weekly cycle in the seasonal chart

auto arima reccomends a seasonal arima model

The models look reasonably ok, however for it isnt accounting for spikes that typically occur towards the end and beginning of the months, therefore I think we can improve the model by including exogenous x variables

the model is hitting the spike at the end but is over forecasting, I think this is because of the way I have put the four days into my model, will try again but enter them as seperate columns

visually the final model looks slighly better, but according to rmse and mae it is slightly worse, i think the problem is that we simply dont have enough data points for the exogenous regressors to get them perfect, I am going to use model three because I think follows the shape of the data best, and i think the extra month in training will help it and i suspect in the real world it will be best

It isnt perfect I think that it looks like it might be overdoing the spike at the end of the month but given i think there is a slight underforecast earlier in the month looking at it visually the overall monthly forecast shoudnt be too far off.

next we move onto expenditure

Starting with direct debits

its looks as though the spike occurs at the start of each month and in the middle on the 15th. The exception is May where the first spike is on the 2nd of may, this makes sense since the 1st is a sunday and in many countries May is public holiday. however the spike in the middle of the month is on the 17th rather than the 15th. I cant think why this might be since this is a tuesday.

In any case I think it doesnt matter, since there are only a few dates I can hardcode them into my training data and for august I can use the 1st and the 15th which are both mondays

the model looks very good, the best results I think we have seen so far

Next is debit transfers

Again the model looks pretty good

groceries model

Finally the remaining expenditure

I never got time to complete the bonus clustering task, however I left some of my code in the document as I think it demonstrates more of my skills

4. Clustering

Having looked at the data I think there is a really good case for creating clusters seperating out different types of customers, there are a number of reasons that this could potentially be a really useful thing to do.

- Different types of customers will have different risk profiles associated when it comes to making credit decisions

- Forecasts will be more accurate for customers who regularly use the platform

- Understanding different groups allows the tailoring of our service based on different needs

- Seperating out in active or customers who have churned will help tailor and optimise marketing strategy

note that normally scaling is required for clustering because, linear distance measures are used in the model, in this case this is not neccesary as the inputs are all in the same scale

to re-iterate the clustering task isnct complete. I orignally thought it might be a good idea to create clusters and re do the task forecasting for the indivual groups, but alas was limited with time